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Abstract 

Learning how to rank multivariate unlabeled 
observations depending on their degree of ab¬ 
normality/novelty is a crucial problem in a 
wide range of applications. In practice, it 
generally consists in building a real valued 
’’scoring” function on the feature space so 
as to quantify to which extent observations 
should be considered as abnormal. In the 1- 
d situation, measurements are generally con¬ 
sidered as ’’abnormal” when they are remote 
from central measures such as the mean or 
the median. Anomaly detection then relies 
on tail analysis of the variable of interest. 
Extensions to the multivariate setting are far 
from straightforward and it is precisely the 
main purpose of this paper to introduce a 
novel and convenient (functional) criterion 
for measuring the performance of a scoring 
function regarding the anomaly ranking task, 
referred to as the Excess-Mass curve (EM 
curve). In addition, an adaptive algorithm 
for building a scoring function based on un¬ 
labeled data Xi , ..., X„ with a nearly opti¬ 
mal EM is proposed and is analyzed from a 
statistical perspective. 


1 Introduction 

In a great variety of applications {e.g. fraud detec¬ 
tion, distributed fleet monitoring, system management 
in data centers), it is of crucial importance to ad¬ 
dress anomaly/novelty issues from a ranking point of 
view. In contrast to novelty/anomaly detection {e.g. 


Appearing in Proceedings of the 18*^ International Con¬ 
ference on Artificial Intelligence and Statistics (AISTATS) 
2015, San Diego, CA, USA. JMLR: W&CP volume 38. 
Copyright 2015 by the authors. 


[4, 13, 10, 12]), novelty/anomaly ranking is very poorly 
documented in the statistical learning literature (see 
[14] for instance). However, when confronted with 
massive data, being enable to rank observations ac¬ 
cording to their supposed degree of abnormality may 
significantly improve operational processes and allow 
for a prioritization of actions to be taken, especially 
in situations where human expertise required to check 
each observation is time-consuming. When univari¬ 
ate, observations are usually considered as ’’abnormal” 
when they are either too high or else too small com¬ 
pared to central measures such as the mean or the me¬ 
dian. In this context, anomaly/novelty analysis gen¬ 
erally relies on the analysis of the tail distribution of 
the variable of interest. No natural (pre-) order ex¬ 
ists on a d-dimensional feature space, A C say, as 
soon as d > 1. Extension to the multivariate setup 
is thus far from obvious and, in practice, the opti¬ 
mal ordering/ranking must be learned from training 
data Ai, ..., A„, in absence of any parametric as¬ 
sumptions on the underlying probability distribution 
describing the’’normal” regime. The most straightfor¬ 
ward manner to define a preorder on the feature space 
X is to transport the natural order on the real half-line 
through a measurable scoring function s : A —)■ IR+: 
the ’’smaller” the score s(A), the more ’’abnormal” the 
observation X is viewed. Any scoring function defines 
a preorder on A and thus a ranking on a set of new 
observations. An important issue thus concerns the 
definition of an adequate performance criterion, C(s) 
say, in order to compare possible candidate scoring 
function and to pick one eventually: optimal scoring 
functions s* being then defined as those optimizing C. 
Throughout the present article, it is assumed that the 
distribution F of the observable r.v. X is absolutely 
continuous w.r.t. Lebesgue measure Leb on A, with 
density f{x). The criterion should be thus defined in 
a way that the collection of level sets of an optimal 
scoring function s*(x) coincides with that related to 
/. In other words, any nondecreasing transform of 
the density should be optimal regarding the ranking 
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performance criterion C. According to the Empirical 
Risk Minimization (ERM) paradigm, a scoring func¬ 
tion will be built in practice by optimizing an empir¬ 
ical version C„(s) of the criterion over an adequate 
set of scoring functions Sq of controlled complexity 
{e.g. a major class of finite VC dimension). Hence, 
another desirable property to guarantee the universal 
consistency of ERM learning strategies is the uniform 
convergence of C„(s) to C(s) over such collections 5o 
under minimal assumptions on the distribution F{dx). 
In [1, 2], a functional criterion referred to as the Mass- 
Volume (MV) curve, admissible with respect to the 
requirements listed above has been introduced, ex¬ 
tending somehow the concept of ROC curve in the 
unsupervised setup. Relying on the theory of mini¬ 
mum volume sets (see e.g. [8, 11] and the references 
therein), it has been proved that the scoring functions 
minimizing empirical and discretized versions of the 
MV curve criterion are accurate when the underlying 
distribution has compact support and a first algorithm 
for building nearly optimal scoring functions, based on 
the estimate of a finite collection of properly chosen 
minimum volume sets, has been introduced and ana¬ 
lyzed. However, by construction, learning rate bounds 
are rather slow (of the order namely) and cannot 

be established in the unbounded support situation, un¬ 
less very restrictive assumptions are made on the tail 
behavior of F{dx). See Figure 3 and related comments 
for an insight into the gain resulting from the concept 
introduced in the present paper in contrast to the MV 
curve minimization approach. 

Given these limitations, it is the major goal of this 
paper to propose an alternative criterion for anomaly 
ranking/scoring, called the Excess-Mass curve (EM 
curve in short) here, based on the notion of density 
contour clusters [7, 3, 6]. Whereas minimum volume 
sets are solutions of volume minimization problems un¬ 
der mass constraints, the latter are solutions of mass 
maximization under volume constraints. Exchanging 
this way objective and constraint, the relevance of this 
performance measure is thoroughly discussed and ac¬ 
curacy of solutions which optimize statistical counter¬ 
parts of this criterion is investigated. More specifically, 
rate bounds of the order are proved, even in the 

case of unbounded support. Additionally, in contrast 
to the analysis carried out in [1], the model bias issue 
is tackled, insofar as the assumption that the level sets 
of the underlying density f{x) belongs to the class of 
sets used to build the scoring function is relaxed here. 

The rest of this paper is organized as follows. Section 3 
introduces the notion of EM curve and that of optimal 
EM curve. Estimation in the compact support case is 
covered by section 4, extension to distributions with 
non compact support and control of the model bias are 


tackled in section 5. A simulation study is performed 
in section 6. All proofs are deferred to the Appendix 
section. 

2 Background and related work 

As a first go, we first provide a brief overview of the 
scoring approach based on the MV curve criterion, 
as a basis for comparison with that promoted in the 
present paper. 

Here and throughout, the indicator function of any 
event £ is denoted by 1^, the Dirac mass at any point 
X hy 5x, AAB the symmetric difference between two 
sets A and B and by S the set of all scoring functions 
s : X —>■ IR_|_ integrable w.r.t Lebesgue measure. Let 
s G S. As defined in [1, 2], the MV-curve of s is the 
plot of the mapping a G (0,1) i-)- MVs{a) = Xg o 
where as(t) = P(s(A) > t), Xs{t) = Leb{{x G 
X,s{x) > t}) and F[~^ denotes the pseudo-inverse of 
any cdf FI : M —> (0,1). This induces a partial ordering 
on the set of all scoring functions: s is preferred to s' 
if MVs{a) < MVgi{a) for all a G (0,1). One may 
show that MV*(a) < MVs(a) for all a G (0,1) and 
any scoring function s, where MV* (a) is the optimal 
value of the constrained minimization problem 

min Leb{T) subject to P(A G E) > a. (1) 

r borelian 

Suppose now that F{dx) has a density f{x) satisfying 
the following assumptions: 

Ai The density f is bounded, i.e. ||/(A)||oo < +oo . 
A 2 The density f has no flat parts: Vc > 0, P{/(A) = 
c} = 0 . One may then show that the curve MV* is 
actually a MV curve, that is related to (any increasing 
transform of) the density / namely: MV* = MV/. In 
addition, the minimization problem (1) has a unique 
solution r* of mass a exactly, referred to as mini¬ 
mum volume set (see [8]): MV* (a) = Le&(r*) and 
F{T'^) = a. Anomaly scoring can be then viewed 
as the problem of building a scoring function s(x) 
based on training data such that MV^ is (nearly) mini- 

d^. ^ 

mum everywhere, i.e. minimizing jjMV^ — MV* jj 00 = 
supagfoq] |MVs(a) —MV*(q;)|. Since F is unknown, a 
minimum volume set estimate T* can be defined as the 
solution of (1) when F is replaced by its empirical ver¬ 
sion F„ = (l/n)X)r=i minimization is restricted 
to a collection Q of borelian subsets of A supposed not 
too complex but rich enough to include all density level 
sets (or reasonable approximants of the latter) and a 
is replaced by a — 4>n, where the tolerance parameter 
<j)n is a probabilistic upper bound for the supremum 
suppgg |F„(r) —F(r)|. Refer to [11] for further details. 
The set Q should ideally offer statistical and compu¬ 
tational advantages both at the same time. Allowing 
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for fast search on the one hand and being sufficiently 
complex to capture the geometry of target density level 
sets on the other. In [1], a method consisting in pre¬ 
liminarily estimating a collection of minimum volume 
sets related to target masses 0 < oi < ... < ax < 1 
forming a subdivision of ( 0 , 1 ) based on training data 
so as to build a scoring function s = 

fc 

been proposed and analyzed. Under adequate assump¬ 
tions (related to G, the perimeter of the r*^’s and the 
subdivision step in particular) and for an appropriate 
choice of if = if„ either under the very restrictive as¬ 
sumption that F{dx) is compactly supported or else 
by restricting the convergence analysis to [ 0,1 — e] for 
e > 0 , excluding thus the tail behavior of the distribu¬ 
tion F from the scope of the analysis, rate bounds of 
the order have been established to guaran¬ 

tee the generalization ability of the method. 

Figure 3 illustrates the problems inherent to the use of 
the MV curve as a performance criterion for anomaly 
scoring in a ’’non asymptotic” context, due to the prior 
discretization along the mass-axis. In the 2-d situation 
described by Fig. 3 for instance, given the training 
sample and the partition of the feature space depicted, 
the MV criterion leads to consider the sequence of 
empirical minimum volume sets Ai, Ai U A 2 , U 
A 3 , A 1 UA 2 UA 3 and thus the scoring function si(x) = 
I{a: S Ai} -|-I{a:: G Ai U A 2 } -l-I{* G Ai U A 3 }, whereas 
the scoring function S 2 {x) = I{a; G Aij -|- Ija:: G Ai U 
A 3 } is clearly more accurate. 

In this paper, a different functional criterion is pro¬ 
posed, obtained by exchanging objective and con¬ 
straint functions in ( 1 ), and it is shown that optimiza¬ 
tion of an empirical discretized version of this perfor¬ 
mance measure yields scoring rules with convergence 
rates of the order Op{l/^/n). In addition, the results 
can be extended to the situation where the support of 
the distribution F is not compact. 

3 The Excess-Mass curve 

The performance criterion we propose in order to eval¬ 
uate anomaly scoring accuracy relies on the notion 
of excess mass and density contour clusters, as intro¬ 
duced in the seminal contribution [7]. The main idea is 
to consider a Lagrangian formulation of a constrained 
minimization problem, obtained by exchanging con¬ 
straint and objective in ( 1 ): for t > 0 , 

max {P(V G fl) — tLe&(U)} . (2) 

Q, borelian 

We denote by any solution of this problem. As 
shall be seen in the subsequent analysis (see Proposi¬ 
tion 3 below), compared to the MV curve approach, 
this formulation offers certain computational and theo¬ 
retical advantages both at the same time: when letting 


(a discretized version of) the Lagrangian multiplier t 
increase from 0 to infinity, one may easily obtain solu¬ 
tions of empirical counterparts of (2) forming a nested 
sequence of subsets of the feature space, avoiding thus 
deteriorating rate bounds by transforming the empir¬ 
ical solutions so as to force monotonicity. 

Definition 1. (Optimal EM curve) The optimal 
Excess-Mass curve related to a given probability distri¬ 
bution F{dx) is defined as the plot of the mapping 

t > 0 hA EM*(t) max {P(A G U) - tLeb(n)}. 

f2 borelian 

Equipped with the notation above, we have: 
EM*{t) = P(A G ni) - tLeb{n*) for all t > 0. 

Notice also that EM*(t) = 0 for any t > ||/||oo 



Figure I: EM curves depending on densities 


Lemma 1. (On existence and uniqueness) For 
any subset solution of (2), we have 

{x,f{x) > t} C U} C {x,f{x) > t}almost-everywhere, 

and the sets {x, f{x) > t} and {x, f{x) > t} are both 
solutions of (2). In addition, under assumption A 2 , 



MV'{a) 


Figure 2: Comparison between MV* {a) and EAI*{t) 
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the solution is unique: 

= {a;, f{x) >t} = {x, f{x) > t}. 

Observe that the curve EM* is always well-defined, 
since /j>j(/(a:) — t)dx = — t)dx. We also 

point out that EM*(t) = a{t) — tX{t) for all t > 0, 
where we set a = af and A = A/. 

Proposition 1. (Derivative and convexity of 
EM*) Suppose that assumptions Ai and A 2 are full- 
filled. Then, the mapping EM* is differentiable and 
we have for all t > 0; 

EM*'{t) = -X{t). 

In addition, the mapping t > 0 i-A- Xft) being decreas¬ 
ing, the curve EM* is convex. 

We now introduce the concept of Excess-Mass curve 
of a scoring function s G S. 

Definition 2. (EM CURVES) The EM curve of s G S 
w.r.t. the probability distribution F{dx) of a random 
variable X is the plot of the mapping 

EMg : t G [0,oo[i-)' sup P(A G A) — tLeb{A), 

Aei{{^s,i)i>o} 

( 3 ) 

where ^ = {x G A, s{x) > t} for all t > 0. One may 
also write: Vt > 0, EMs(t) = sup„>g as{u) — tXs{u). 
Finally, under assumption Ai, we have EMs(t) = 0 
for every t> ||/||oo- 

Regarding anomaly scoring, the concept of EM curve 
naturally induces a partial order on the set of all scor¬ 
ing functions: V(si,S 2 ) € 5^, si is said to be more 
accurate than S 2 when Vt > 0, EMsi(t) > EMs 2 (t). 
Observe also that the optimal EM curve introduced 
in Definition 1 is itself the EM curve of a scoring func¬ 
tion, the EM curve of any strictly increasing trans¬ 
form of the density / namely: EM* = EM/. Hence, 
in the unsupervised framework, optimal scoring func¬ 
tions are those maximizing the EM curve everywhere. 
In addition, maximizing EMg can be viewed as recov¬ 
ering a collection of subsets (f 2 )')i>o with maximum 
mass when penalized by their volume in a linear fash¬ 
ion. An optimal scoring function is then any s G S 
with the flj’s as level sets, for instance any scoring 
function of the form 

r+oo 

s(x) = / a(t)dt, (4) 

Jt=o 

with a{t) > 0 (observe that s{x) = f{x) for a = 1 ). 

Proposition 2. (Nature of anomaly scORiNcj 
Let s G S. The following properties hold true. 


(i) The mapping EMg is non increasing on (0,-l-oo), 
takes its values in [ 0 , 1 ] and satisfies, EMs(t) < 
EM*(t) for all t>0. 

(a) For t > 0, we have: 0 < EM*(t) — EMs(t) < 
ll/lloo inf„>o Leb{{s > u}A{f > t}). 

(Hi) Let e > 0. Suppose that the quantity 

sup„>e//-!({„}) l/||V/(x)|| dp{x) is bounded, 
where fj, denotes the {d — 1)-dimensional Haus- 
dorff measure. Set Ci := infj'||/— To sjjoo; where 
the infimum is taken over the set T of all borelian 
increasing transforms T : IR_|_ —)■ IR_|_. Then 

sup |EM*(t) — EMs(t)| 

ieb+ei.ll/lloo] 

< f^i^n^llZ-Tosjloo 

where Ci = C{ei,f) is a constant independent 
from s(x). 

Assertion (ii) provides a control of the pointwise dif¬ 
ference between the optimal EM curve and EMg in 
terms of the error made when recovering a specific 
minimum volume set by a level set of s(x). As¬ 
sertion (Hi) reveals that, if a certain increasing trans¬ 
form of a given scoring function s(a;) approximates 
well the density f(x), then s(x) is an accurate scor¬ 
ing function w.r.t. the EM criterion. As the distri¬ 
bution F(dx) is generally unknown, EM curves must 
be estimated. Let s G S and Xi, ..., A„ be an 
i.i.d. sample with common distribution F{dx) and set 
^sit) = (l/’T^) Yll)=i '^s(Xi)>t- The empirical EM curve 
of s is then defined as 

EMs{t) = sup{Ss(m) - tXsiu)} . 

u>0 

In practice, it may be difficult to estimate the volume 
Xs(u) and Monte-Carlo approximation can naturally 
be used for this purpose. 

4 A general approach to learn a 
scoring function 

The concept of EM-curve provides a simple way to 
compare scoring functions but optimizing such a func¬ 
tional criterion is far from straightforward. As in [ 1 ], 
we propose to discretize the continuum of optimiza¬ 
tion problems and to construct a nearly optimal scor¬ 
ing function with level sets built by solving a finite 
collection of empirical versions of problem ( 2 ) over a 
subclass Q of borelian subsets. In order to analyze the 
accuracy of this approach, we introduce the following 
additional assumptions. 

A 3 All minimum volume sets belong to Q: 

Vt > 0, LI\gQ . 
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A 4 The Rademacher average 


= E 

1 

sup — 

n 


neg n 



is of order Op{n where (ei)i>i is a Rademacher 

chaos independent of the Xi’s. 

Assumption A 4 is very general and is fulfilled in partic¬ 
ular when Q is of finite VC dimension, see [5], whereas 
the zero bias assumption A 3 is in contrast very restric¬ 
tive. It will be relaxed in section 5. 

Let S € (0,1) and consider the complexity penalty 
-f ■ We have for all n > 1: 

P Qsup (|P(G) - P„(G)| - > o|^ < d, (5) 

see [5] for instance. Denote by P„ = (1/n) (5xi 
the empirical measure based on the training sample 
Vi, ..., Xn- For t > 0, define also the signed mea¬ 
sures: 


Ht{-)=F{-)-tLeh{-) 
and = Fn{-) - tLeb{-). 

Equipped with these notations, for any s G 5, we point 
out that one may write EM*(t) = sup^^^Q Ht{{x e 
X,f{x) > u}) and EMs(t) = sup„>oP/({a: e 
X,s{x) > m}). Let K > 0 and 0 < tx < tx-i < 

... < ti- Eor k in {1, ..., K}, let Clt^. be an empirical 
tk-cluster, that is to say a borelian subset of A such 
that 

Dtfc € arg max Fln^ t.m- 

The empirical excess mass at level tk is then 
Hn,tk{^tk)- The following result reveals the bene¬ 
fit of viewing density level sets as solutions of ( 2 ) 
rather than solutions of ( 1 ) (corresponding to a dif¬ 
ferent parametrization of the thresholds). 
Proposition 3. (Monotonicity) For any k in 
{1, ..., K}, the subsets and are still 

empirical t^-clusters, just like 

The result above shows that monotonous (regarding 
the inclusion) collections of empirical clusters can al¬ 
ways be built. Coming back to the example depicted 
by Fig. 3, as t decreases, the Dt’s are successively 
equal to Ai, A1UA3, and Ai U A3 U A2, and are thus 
monotone as expected. This way, one fully avoids the 
problem inherent to the prior specification of a subdi¬ 
vision of the mass-axis in the MV-curve minimization 
approach (see the discussion in section 2 ). 


Consider an increasing sequence of empirical tk clus¬ 
ters (^tk)i<k<K and a scoring function s G A of the 
form 


K 

SKix) , (6) 

k=l 

where Uk > 0 for every fc G {1, ..., K}. Notice that 
the scoring function (6) can be seen as a Riemann sum 
approximation of (4) when au = a{tk) — a{tk+i)- For 
simplicity solely, we take Ok = tk — tfc+i so that the 
City’s are t^-level sets of sk, be = {s > tk} and 
{s> t} = ^tk if t &]tk+i,tk]- Observe that the results 
established in this paper remain true for other choices. 
In the asymptotic framework considered in the subse¬ 
quent analysis, it is stipulated that K = Kn —>■ 00 as 
n —)■ - 1 - 00 . We assume in addition that ^ 

Remark 1. (Nested sequences) For L < K, we 
have {Ttsk,h^ > 0} = {^tk)o<k<L C {^tk)o<k<K = 
> 0 }, so that by definition, EMs^ < EMgf.. 

Remark 2. (Related work) IFe point out that a 
very similar result is proved in [9] (see Lemma 2.2 
therein) concerning the Lebesgue measure of the sym¬ 
metric differences of density clusters. 

Remark 3. (Alternative construction) It is 
noteworthy that, in practice, one may solve the op¬ 
timization problems fltk G argmaxogg (D) and 
next form Lltk = Ui<fcOt.. 

The following theorem provides rate bounds describing 
the performance of the scoring function sk thus built 
with respect to the EM curve criterion in the case 
where the density / has compact support. 

Theorem 1. (Compact support case) Assume 
that conditions Ai, A 2 , A 3 and A 4 hold true, and 
that f has a compact support. Let 6 g]0, I[, let 
(4)fce{u .... K} be such that supi<j ,<^(4 - tfc+i) = 
0{l/y/n). Then, there exists a constant A indepen¬ 
dent from the tk’s, n and S such that, with probability 
at least I — d, we have: 


sup I EM* (t) - EMs^ (t) I 

iejo.ti] 


< 


(a -f a/2 log(l/5) -f Leb{suppf)^ 


Remark 4. (Localization) The problem tackled in 
this paper is that of scoring anomalies, which corre¬ 
spond to observations lying outside of ’’large” excess 
mass sets, namely density clusters with parameter t 
close to zero. It is thus essential to establish rate 
bounds for the quantity supjgjQ g^j |EM*(t) —EMs^(t)|, 
where G > 0 depends on the proportion of the ’’least 
normal” data we want to score/rank. 
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5 Extensions - Further results 

This section is devoted to extend the results of the 
previous one. We first relax the compact support as¬ 
sumption and next the one stipulating that all density 
level sets belong to the class Q, namely A 3 . 

5.1 Distributions with non compact support 


limi_).o h{t) = - 1 - 00 . Just like the previous approach, 
the grid is described by a decreasing sequence (tfc)- Let 
ti > 0, iV > 0 and define recursively ti > > ■ ■ ■ > 

In > tN+i = 0 , as well as ■. ■, through 

4+1 = (8) 

At, = argmax7J„_t,(fl), (9) 


It is the purpose of this section to show that the al¬ 
gorithm detailed below produces a scoring function s 
such that EMs is uniformly close to EM* (Theorem 
2). See Figure 3 as an illustration and a comparaison 
with the MV formulation as used as a way to recover 
empirical minimum volume set Tq, . 

Algorithm 1 . Suppose that assumptions Ai, A 2 , 
A 3 , A 4 hold true. Letti such t/iatmaxQgg (^(12) > 
0. Fix N > 0. For fc = 1, ..., N, 

1. Find fit, e argmaxogp iJ„_tfc(fl) , 

2. Define At, — 

3. Set tk+i = fork<N-l. 


with the property that At,^, D At,. As pointed out 
in Remark 3, it suffices to take flt^+i = ^t^+i U fit,, 
where flt,^i = argmax^gg iJ„_tfc(fl)- This yields the 
scoring function sn defined by (7) such that by virtue 
of Lemma 2 (see the Technical Deails), with probabil¬ 
ity at least 1 — J, 

sup \EM*{t)-EMs^{f)\ 

< (a+ V21og(l/(5) -k sup ^ ■ 

V i<k<N h{tk) J vn 

Therefore, if we take h such that \{t) = 0{h{t)) as 
t —)■ 0 , we can assume that \{t)/h{t) < R for t in 
]0,ti] since A is decreasing, and we obtain: 


In order to reduce the complexity, we may replace steps 
1 and 2 with fit, € argmax^^^^ H„ ,t,(f!). The 
resulting piecewise constant scoring function is 

N 

giv(x) = 




A2 

• 

• • 

• 

• . • 

■Ai. 


ni, 712, TI 3 — 10, 9, 1 


sup I EM* (t) - EMs,., (t) I 

< (A+y21og(l/J)) ^ . (10) 

On the other hand from tLeb{{f > t}) < / < 1, 

we have X{t) < l/t. Thus h can be chosen as h{t) := 
1/t for t e]0,ti]. In this case, (9) yields, for k>2, 




(1 + 


tl 


J-U-l ■ 
Vn' 


( 11 ) 


Theorem 2 . (Unbounded support caseJ Suppose 
that assumptions Ai, A 2 , A 3 , A 4 hold true, let U > 0 
and for k > 2, consider tk as defined by (11), fit, by 
(8), and sn (7). Then there is a constant A indepen¬ 
dent from N, n and 6 such that, with probability larger 
than 1 — (5, we have: 


Figure 3: Sample of n = 20 points in a 2 -d space, parti¬ 
tioned into three rectangles. As a increases, the minimnm 
volnme sets Fa are successively equal toAi, A 1 UA 2 , AiU 
A 3 , and A 1 UA 3 UA 2 , whereas, in the EM-approach, as t de¬ 
creases, the At’s are successively equal to Ai, A 1 UA 3 , and 
Ai U A 3 U A 2 . 


The main argument to extend the above results to the 
case where suppf is not bounded is given in Lemma 2 
in the ’’Technical Details” section. The meshgrid {tk) 
must be chosen adaptively, in a data-driven fashion. 
Let h : IR(j_ —)■ IR+ be a decreasing function such that 


sup \EM*{t)-EMs„{f)\ 

iejo.ii] 


< 


A+V21og(l/J) 


1 


\/n 


ojv(l), 


where ojv(l) = 1 — EM*{t]s[). In addition, stv(x) con¬ 
verges to Soc(x) := Z)^i(U+i - 4)1 q, as a 00 

'fc + l 

and Sea is such that, for all S G (0,1), we have with 
probability at least 1 — J; 


sup \EM*it)- EMs^{t)\< \a+ ^j2\og{l/5) 


/n 
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5.2 Bias analysis 


In this subsection, we relax assumption A 3 . For any 
collection C of subsets of a{C) denotes here the 
(j-algebra generated by C. Consider the hypothesis 
below. 

A 3 There exists a countable subcollection of Q, F = 
{Fi}i>i say, forming a partition of X and such that 
cr{F) C G - 

Denote by fp the best approximation (for the Li- 
norm) of / by piecewise functions on F, 

Mi) 


Then, variants of Theorems 1 and 2 can be established 
without assumption A 3 , as soon as A 3 holds true, at 
the price of the additional term ||/ — fpWh^ in the 
bound, related to the inherent bias. For illustration 
purpose, the following result generalizes one of the in¬ 
equalities stated in Theorem 2: 

Theorem 3. (Biased empirical clusters^ Sup¬ 
pose that assumptions Ai, A 2 , A 3 , A 4 hold true, let 
ti > 0 and for k >2 consider tk defined by (11), Git,, 
by (8), and sn by (7). Then there is a constant A 
independent from N, n, S such that, with probability 
larger than 1 — 5, we have: 


sup \EM*{t)-EMs^{t)\ 
te]o,ti] 


< 


A-h V21og(l/(5) 



+ 11/ - /fIIli + OAr(l), 


where on{1) = 1 — EM*(In). 

Remark 5. (Hypercubes) In practice, one defines 
a sequence of models Fi C Gi indexed by a tuning pa¬ 
rameter I controlling (the inverse of) model complex¬ 
ity, such that II/ — /fJIli —>■ 0 os ^ 0. For instance, 

the class Fi could be formed by disjoint hypercubes of 
side length 1. 


points in the training set, or such that an empirical es¬ 
timate of P(A e [—L,L]'^) is very close to 1 (here one 
obtains 0.998 for L = 500). The implementation of our 
algorithm involves the use of a sparse matrix to store 
the data in the partition of hypercubes, such that the 
complexity of the procedure for building the scoring 
function s and that of the computation of its empiri¬ 
cal EM-curve is very small compared to that needed 
to compute fp, and EMf^^, which are given here for 
the sole purpose of quantifying the model bias. 

Fig. 4 illustrates as expected the deterioration of EMg 
for large I, except for t close to zero: this corresponds 
to the model bias. However, Fig. 5 reveals an ’’over¬ 
fitting” phenomenon for values of t close to zero, when 
I is fairly small. This is mainly due to the fact that 
subsets involved in the scoring function are then tiny 
in regions where there are very few observations (in 
the tail of the distribution). On the other hand, for 
the largest values of t, the smallest values of I give the 
best results: the smaller the parameter I, the weaker 
the model bias and no overfitting is experienced be¬ 
cause of the high local density of the observations. 
Recalling the notation EMg(t) = maxogg i4i(r2) < 
EM*{t) = maxQ meas. so that the bias of our 

model is EM* — EMg, Fig. 6 illustrates the variations 
of the bias with the wealth of our model characterized 
by I the width of the partition by hypercubes. Notice 
that partitions with small I are not so good approxi¬ 
mation for large t, but are performing as well as the 
other in the extreme values, namely when t is close to 
0. On the top of that, those partitions have the merit 
not to overfit the extreme datas, which typically are 
isolated. 

This empirical analysis demonstrates that introducing 
a notion of adaptivity for the partition F, with pro¬ 
gressively growing bin-width as t decays to zero and 
as the hypercubes are being selected in the construc¬ 
tion of s (which crucially depends on local properties 
of the empirical distribution), drastically improves the 
accuracy of the resulting scoring function in the EM 
curve sense. 


6 Simulation examples 

Algorithm 1 is here implemented from simulated 2- 
d heavy-tailed data with common density f(x, y) = 
1/2 X 1/(1 -I- |x|)^ X 1/(1 -I- \y\Y. The training set is of 
size n = 10®, whereas the test set counts 10® points. 
For ^ > 0, we set Gi = o’(F) where F; = and 

F( = [lii,lii -I- 1] X [li 2 , Ih + 1] for all i = (B, 12 ) G 
The bias of the model is thus bounded by ||/ — /f||oo, 
vanishing as I —>■ 0 (observe that the bias is at most 
of order I as soon as / is Lipschitz for instance). The 
scoring function s is built using the points located in 
[—F, L]^ and setting s = 0 outside of [—F, F]^. Practi¬ 
cally, one takes F as the maximum norm value of the 


7 Conclusion 

Prolongating the contribution of [1], this article pro¬ 
vides an alternative view (respectively, an other pa¬ 
rameterization) of the anomaly scoring problem, lead¬ 
ing to another adaptive method to build scoring func¬ 
tions, which offers theoretical and computational ad¬ 
vantages both at the same time. This novel formula¬ 
tion yields a procedure producing a nested sequence of 
empirical density level sets, and exhibits a good per¬ 
formance, even in the non compact support case. In 
addition, the model bias has been incorporated in the 
rate bound analysis. 
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Figure 4: Optimal and Figure 5: Zoom near 0 
realized EM curves 


following results, the second one being a straightfor¬ 
ward consequence of the derivative property of EM* 
(Proposition 1): 

• With probability at least 1 — for /c S {1,AT}, 

0 < EM*{tk) - EM,^{tk) < 2$„((5) . 

• Let k in AT — 1}. Then for every t in 

]^fc + l J ik \; 

0 < EM*it) - EM*itk) < A(4+i)( 4 - 4+i) • 

Proof of Theorem 2 (Sketch of) The first assertion 
is a consequence of (10) combined with the fact that 

sup \EM*it)-EM,^it)\ < l-EM,^itN) 

te]o,tjv] 

< 1-AM*(tjv) + 2$„(5) 


zoom near t=0 



Figure 6: EMg for different I 

Technical Details 

Proof of Theorem 1 (Sketch of) The proof results 
from the following lemma, which does not use the com¬ 
pact support assumption on / and is the starting point 
of the extension to the non compact support case (sec¬ 
tion 5.1). 

Lemma 2 . Suppose that assumptions Ai, A 2 , A 3 
and A 4 are fulfilled. Then, for 1 < k < K — 1, there 
exists a constant A independent from n and 5, such 
that, with probability at least 1 — 5, for t in ]tk+i,tk\, 

|EM*(t)-EM,^(t)| < (^A+^2logil/5)) ^ 

+ A(tfc+i)(t/c — tfc+i). 

The detailed proof of this lemma is in the supple¬ 
mentary material, and is a combination on the two 


holds true with probability at least 1 — For the sec¬ 
ond part, it suffices to observe that SAr(a;) (absolutely) 
converges to Soo and that, as pointed out in Remark 
1 , < EMs^. 

Proof of Theorem 3 (Sketch of) The result 
directly follows from the following lemma, which 
establishes an upper bound for the bias, with the 
notations EM2(t) := maxQ^^c Hti^) < EM*(t) = 
maxQ meas. Hffl) for any class of measurable sets C, 
and E := cr(A) so that by assumption A3, E C G- 
Details are omitted due to space limits. 

Lemma 3. Under assumption A 3 , we have for every 
t in [0, ll/lloo], 

0<EM*(t)-EM^(t)< ||/-/;^|Ui . 

The model bias EM* —EMA is then uniformly bounded 

by Wf-fpWm- 

To prove this lemma (see the supplementary material 
for details), one shows that: 

EM*(t)-EM^(t)< f if-fp) 

Jf>t 

+ [ Up - t) 

- f Up -t) ^ 

where we use the fact that for all t > 0, {fp > t} £ E 
andVF G E, ^^f = ^Qfp- It suffices then to observe 
that the second and the third term in the bound are 
non-positive. 
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1 Illustrations 

Note that the scoring function we built in Algorithm 
?? is an estimator of the density / (usually called the 
silhouette), since f{x) = /g°°l/>tdt = and 

s(x) 1= which is a discretiza¬ 
tion of This fact is illustrated in Fig. 1 



Figure 1: density and scoring functions 


2 Detailed Proofs 

Proof of Proposition ?? 

Let t > 0. Recall that EM*{t) = a{t) — tX{t) 
where a{t) denote the mass at level t, namely a{t) = 
P(/(A) > t), and A(t) denote the volume at level t, 
i.e. X{t) = Leb{{x, f{x) > t}). For h > 0, let A{h) 
denote the quantity A{h) = l/h{a{t + h) — a{t)) and 


B{h) = l/h{X{t + h) — X{t)). It is straightforward to 
see that A{h) and B{h) converge when h —>■ 0, and ex¬ 
pressing EM*' = a'{t)—tX'{t) — X{t), it suffices to show 
that a'{t) — tX'{t) = 0, namely liuih^o A{h) — t B{h) = 
0. Now we have A{h) — t B{h) = \ It<f<t+hf ~ 

t < Jt<f<t+h ^ “ Leb{t <f<t + h)^0 because 
/ has no Hat part. 


Proof of Lemma ??: 


On the one hand, for every O measurable, 


P(A e O) — t Leb{M) = j {f{x) — t)dx 

Jq, 

< [ (/(x) - t)dx 

Jnn{f>t} 

< [ (/(a:) - t)dx 

= nf{X)>t)-tLebi{f>t}). 


It follows that {f > t} G arg max^meas. P(A G 
A) — t Leb{A). 

On the other hand, suppose O G 
argmax^ meas. P(A G A) — t Leb{A) and 

Leb{{f > t} \ n) > 0. Then there is e > 0 

such that Leb{{f > t + e} \ fl) > 0 (by sub¬ 
additivity of Leb, if it is not the case, then 

Leb{{f > t}\rt) = Leb{UeeQ+{f > t + e}\n) = 0 ). 

We have thus 


I {f{x) — t)dx > e.Leb{{f > t -|- e} \ 0) > 0 , 
{f>t}\n 
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SO that 


To prove the third point, note that: 


[ (fix) - t)dx < [ (/(a;) - t)dx 

Jn 


{/>*} 


/{/>t}\n 


(fix) - t)dx 


< 


[ (fix) - t)dx , 

hf>t} 


i.e 


P(A Gn)-t Leb{n) 

< >t)-t Leb{{x, f{x) > t}) 

which is a contradiction: {f > t} C Leb-a.s. . 

To show that C {a;, f{x) > t}, sup¬ 
pose that Leb{Vtl n {/ < t}) > 0. Then 

by sub-additivity of Leb just as above, there 
is e > 0 s.t Leb{D,l n {/ < t — e}) > 0 and 
/a*n{/<t -4 / - ^ ^ -e.Leb{ni n {/ < t - e}) < 0. 
It follows that P(A G rij) — t Leb{Q.l) < P(A G 
\ {f < t — e}) — t Leb{ni \ {/ < t — e}) which is a 
contradiction with the optimality of . 


Proof of Proposition ?? 

Proving the first assertion is immediate, since 
//>i(/(a') ~ i)dx > — t)dx. Let us now turn 

to The second assertion. We have: 

EM*{t) — EMs{t) = ( {f{x) — t)dx 

J f>t 

— sup / (/(a;) — t)dx 

u>0 J s>u 

= inf / (/(a;) - t)dx 


(fix) - t)dx 


' S>U 


yet: 


/ (f(x) — t)dx+ / (t — f{x))dx 


< 


t).Leb(^{f > t}\{s > u} 


inf Leb({s > u}A{f > t}) 

u>0 \ / 


mf^Leb^{Ts > t}A{f > t} 


Yet, 

Leb(^{Ts > t}A{f > 

< Leb{{f > t - \\Ts - /Iloo} \ {/ > i + \\Ts - f\U}) 
= Xit-\\Ts-f\U - A(t+||Ts-/||^) 

rt+\\T.-f\\^ 

= — {u)du . 

On the other hand, we have \(t) = J^itf(^x)>tdx = 
Iroi 9 ix)\\Vf{x)\\dx where we let g{x) = 

||v/(a;)|| fi{a:,||v/(a:)||>o,/(a;)>i}- The co-area for¬ 
mula (see [1], p.249, th3.2.12) gives in this case: 

= Jr If-I(u) ||V/(a;)|| ^{x,f{x)>t}dgix) = 
It ^^//-!(“) ||v/(a:)||so that X'{t) = 

Let ric such that Vu > e, |A'(u)| = 

■ff-Hu) < Ve- We obtain: 

sup EM*(t) - EM,(t) 

te[£+infT^ ll/-Ts||oo,||/||oo] 

< 2.r7£-ll/l|oo iiT ||/-Ts||oo- 

In particular, if inf^||/ — Ts||oo < ei, 

sup <2.r;,.||/|U.inf ||/-Ts||oo • 

b+«^i.ll/lloo] 

Proof of Proposition ?? 

Let i in A}. First, note that: 

Hn,tA^U + l n Ot.) = Hn,t.(ht.) - \ a.+J. 


-b t Leb({s > u}\{f > t}^. 

It follows that 

so we obtain: 

Hn,ti^.i U Ot. ) + Hn^ti F ^ti ) 

EM*{t) - EMsit) < max(t, ll/lloo - t) 

= Hn,ti+AJ^ti+A + \ ^ti+i) 

X Leb(^{s > u}A{f > 


< \\f\\oo-Leb(^{s > u}A{f > t}y 

with \ ) - Hn,u \ ^ti+i) > 0 since 

Hn,t is decreasing in t. But on the other hand, by 
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definition of ^ti+i and D,t- we have: 

Finally we get: 

Hn,ti+ii^ti+i U^ti) = ^n,ii+i ) ) 

Proceeding by induction we have, for every m such 
that k + m < K: 

Hn,ti + rni^U ^*1 + 1 ••• + = -^71, t ; + „ (; + „ ) ) 

n n... n . 

Taking (i=l, m=k-l) for the first equation and (i=k, 
m=K-k) for the second completes the proof. 

Proof of Theorem ?? 

We shall use the following lemma: 

Lemma 2.1. With probability at least 1 — <5, for k G 
{1, 0 < EM*itk) - EMsAtk) < 2$„((5). 

Proof of Lemma 2.1: 

Remember that by definition of 
maxnep Eln^tki^) and note that: 


Lemma 2.2. Let k in {1,..., K — 1}. Then for every 
t m ]tk+iM, 0 < EM*{t) - EM*{tk) < A(4+i)(tfc - 

tk+i) ■ 

Combined with Lemma 2.1 and the fact that EMs,^ 
is non-increasing, and writing EM*{t) — EMg^ff) = 

{EM*{t) - EM*{tk)) + {EM*{tk) - EM^^itk)) + 
{EMsj^{tk) — EMs^{t)) this result leads to: 

Vk G {0,...,K -1}, yt G]tk+i,tk], 

0 < EM*{t) - EMsAt) < 2$n(<5) + A(tfc+i)(tfc - tk+i) 

which gives Lemma ?? stated in section Technical De¬ 
tails. Notice that we have not yet used the fact that / 
has a compact support. 

The compactness support assumption allows an ex¬ 
tension of Lemma 2.2 to k = K, namely the in¬ 
equality holds true for t in ]tK+i,tK\ =]0,ti<-] as soon 
as we let X{tK+i) ■= Leb{suppf). Indeed the com¬ 
pactness of suppf implies that A(t) -G Leb{suppf) 
as t —>■ 0. Observing that Lemma 2.1 already con¬ 
tains the case k = K, this leads to, for k in {0, 
and t G ]tk+i,tk], \EM*{t) - EMsj^{t)\ < 2$„((5) -f 
X{tk+i){tk — tk+i)- Therefore, A being a decreasing 
function bounded by X{Leb{suppf)), we obtain the fol¬ 
lowing: with probability at least 1 — i5, we have for all 
t in ]0, ti]: 

< -I- \/2log(l/15)^ 

+ X{Leb{suppf)) sup (tk-tk+i). 

l<k<K 


EM*{tk) = max HAVt) = maxHAn) > HACiA. 

S2 meas. 

On the other hand, using (??), with probability at 
least 1 — d, for every G G G, |]P(G) — P„(G)| < $„(5). 
Hence, with probability at least 1 — d, for all O G CJ : 

HnA^) - ^n(<5) < HAA < Hn,A^) + ^n(<5) 
so that, with probability at least (1 — (5), for k G 

Hn,tkAtk) ~ ‘l’n(<5) ^ Htf.AA 
<EM*{tk) 

< Hn^tkAA + , 

whereby, with probability at least (1 — d), for k G 


Proof of Theorem ?? 

The first part of this theorem is a consequence of (??) 
combined with: 

sup \EM*{t)-EMsAt)\ < l-EMsAtA 

iejo.tjv] 

< 1- EM*{tN) + 2^A6) , 

where we use the fact that 0 < EM*{tN) — 

EMs^ftA < 2$„((5) following from Lemma 2.1. 

To see the convergence of sn{x), note that: 


ti 1 


< XE 




< oo. 


0 < EM*{tk) - HA^A < 2$„(<I) . 

The following Lemma is a consequence of the deriva¬ 
tive property of EM* (Proposition ??) 


and analogically to remark ?? observe that EMg^^ < 
EMg^ so that supjgjg \EM*{f) — EMs^{t)\ < 
supigjQ \EM*{f) — EMs„{t)\ which prooves the last 
part of the theorem. 
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Proof of Lemma ?? 


By definition, for every class of set "H, EM^{t) = 
maxng^ The bias EM*(t) — EMg(t) of the 

model G is majored by EM* (t) —EM^{t) since E C Q. 
Remember that fpix) := JpJ{y)dy 

and note that for all t > 0, {fp > t} G E. It fol¬ 
lows that: 


EM*{t) - EM*p{t) = [ (/ - t) - sup / (/ - t) 

J f>t C&FJC 

< [ if -t)- [ if -t) since {fp > t} G E 
J f>t J fF>t 

= [ if-t)- f ifF-t) 

J f>t J fF>t 

since WG G E, [ f = [ fF 
JG JG 

= f if-t)-f ifF-t)+ [ (/f - t) 

Jf>t Jf>t Jf>t 

-[ ifF-t) 

J fF>t 

= [ if-fF)+[ 

J f>t J{ 


{/>*}\{/F>t} 


' fF>t 

Uf -1) 

\ 

f 

’ {fF>t}\{f>t} 


(/f - 1 ) . 


Observe that the second and the third term in the 
bound are non-positive. Therefore: 


EM*it)-EM*pit)< [ (/-/f)< / |/-/f|. 

J f>t JK"* 
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